En los años de la pandemia, en los Estados Unidos de América, el comportamiento del uso de tarjetas de crédito ha variado bastante. El promedio de la deuda de tarjeta de crédito creció en 52% entre el 2018 y el 2019, sin embargo, este porcentaje ha caído significativamente en el 2020 (ValuePenguin)(Household Debt and Credit Report, NewYorkFed).
Estados Unidos tiene una deuda de 807 mil millones de dolares distribuidas en 506 millones de tarjetas de crédito mientras que la deuda promedio de una familia Estadounidense es de 6,270 dolares. Estos números incrementan cada año hasta el 2020 donde se vio un decremento debido al COVID-19. En el 2020 la deuda promedio por familia disminuyó a 5,315 dolares, casi 1,000 dolares. La pandemia tuvo un gran impacto en la vida de todos lo cual causó cambios que atribuyen estos cambios de deudas. Entre estos cambios está el hecho que la gente utilizaba menos su dinero en compras y gastos al igual que muchos recibieron dinero suplementario por desempleo o por otras razones que pueden utilizar para estas deudas (Resendiz, 2021).
Además, para poder conocer más sobre las posibles causas del decremento de deudas se investigó sobre diferentes métodos para resolver dichas situaciones económicas. La primera es a través de las compañías de liquidación de deudas. Estas compañías ofrecen negociar con los emisores de tarjetas de crédito para que el endeudado tenga la opción de disminuir la cantidad que debe pero debe tomar en cuenta que esa negociación puede tomar tiempo y debe pagarle a dicha compañía de liquidación. Además, debe tener cuidado con compañías fraudulentas que simplemente lo pondrán en una peor situación (Comision Federal de Comercio).
Otra opción sería que el individuo con impago negocie directamente con los acreedores. Es posible que le den una tasa de interés más baja la cual facilita el pago de deuda. Una tasa más baja indica que una menor cantidad de cada pago mensual que realice se consuma a causa de los cargos por intereses acumulados y por ende se podrá pagar la deuda de manera más rápida. Cabe mencionar que maximizar su flujo de efectivo es crucial. Ya sea al conseguir nuevos/mejores empleos o minimizar sus gastos cada cantidad ayuda. Además, debe organizar y priorizar sus deudas. Al analizar y ordenar todas las deudas que tiene lo ayudará desarrollar un plan que indique cómo y en qué orden irá pagando (Debt).
Se puede observar que estos datos y soluciones dependen mucho de los individuos que tienen la deuda. Por lo mismo es importante saber que características tiene un cliente que cae en incumplimiento de pago de tarjeta de crédito. El objetivo de este proyecto es crear modelos de aprendizaje automático que permita predecir qué tipo de clientes podrán caer en impago, o no. Para esto se cuenta con un conjunto de datos que contiene información sobre pagos predeterminados, factores demográficos, datos crediticios, historial de pagos y extractos de cuentas de clientes de tarjetas de crédito en Taiwán desde abril de 2005 hasta septiembre de 2005.
Generales:
Especificos:
Descripcion del dataset como estaba:
Hay 25 variables (22 cuantitativas, 3 categoricas):
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport
import seaborn as sn
import matplotlib.pyplot as plt
from quickda.clean_data import *
from quickda.explore_data import *
from quickda.clean_data import *
df = pd.read_csv("UCI_Credit_Card.csv")
df.head()
| ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | ... | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | default.payment.next.month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 20000.0 | 2 | 2 | 1 | 24 | 2 | 2 | -1 | -1 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 689.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 |
| 1 | 2 | 120000.0 | 2 | 2 | 2 | 26 | -1 | 2 | 0 | 0 | ... | 3272.0 | 3455.0 | 3261.0 | 0.0 | 1000.0 | 1000.0 | 1000.0 | 0.0 | 2000.0 | 1 |
| 2 | 3 | 90000.0 | 2 | 2 | 2 | 34 | 0 | 0 | 0 | 0 | ... | 14331.0 | 14948.0 | 15549.0 | 1518.0 | 1500.0 | 1000.0 | 1000.0 | 1000.0 | 5000.0 | 0 |
| 3 | 4 | 50000.0 | 2 | 2 | 1 | 37 | 0 | 0 | 0 | 0 | ... | 28314.0 | 28959.0 | 29547.0 | 2000.0 | 2019.0 | 1200.0 | 1100.0 | 1069.0 | 1000.0 | 0 |
| 4 | 5 | 50000.0 | 1 | 2 | 1 | 57 | -1 | 0 | -1 | 0 | ... | 20940.0 | 19146.0 | 19131.0 | 2000.0 | 36681.0 | 10000.0 | 9000.0 | 689.0 | 679.0 | 0 |
5 rows × 25 columns
explore(df, method="summarize")
| dtypes | count | null_sum | null_pct | nunique | min | 25% | 50% | 75% | max | mean | median | std | skew | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | int64 | 30000 | 0 | 0.0 | 30000 | 1.0 | 7500.75 | 15000.5 | 22500.25 | 30000.0 | 15000.500 | 15000.5 | 8660.398 | 0.000 |
| LIMIT_BAL | float64 | 30000 | 0 | 0.0 | 81 | 10000.0 | 50000.00 | 140000.0 | 240000.00 | 1000000.0 | 167484.323 | 140000.0 | 129747.662 | 0.993 |
| SEX | int64 | 30000 | 0 | 0.0 | 2 | 1.0 | 1.00 | 2.0 | 2.00 | 2.0 | 1.604 | 2.0 | 0.489 | -0.424 |
| EDUCATION | int64 | 30000 | 0 | 0.0 | 7 | 0.0 | 1.00 | 2.0 | 2.00 | 6.0 | 1.853 | 2.0 | 0.790 | 0.971 |
| MARRIAGE | int64 | 30000 | 0 | 0.0 | 4 | 0.0 | 1.00 | 2.0 | 2.00 | 3.0 | 1.552 | 2.0 | 0.522 | -0.019 |
| AGE | int64 | 30000 | 0 | 0.0 | 56 | 21.0 | 28.00 | 34.0 | 41.00 | 79.0 | 35.486 | 34.0 | 9.218 | 0.732 |
| PAY_0 | int64 | 30000 | 0 | 0.0 | 11 | -2.0 | -1.00 | 0.0 | 0.00 | 8.0 | -0.017 | 0.0 | 1.124 | 0.732 |
| PAY_2 | int64 | 30000 | 0 | 0.0 | 11 | -2.0 | -1.00 | 0.0 | 0.00 | 8.0 | -0.134 | 0.0 | 1.197 | 0.791 |
| PAY_3 | int64 | 30000 | 0 | 0.0 | 11 | -2.0 | -1.00 | 0.0 | 0.00 | 8.0 | -0.166 | 0.0 | 1.197 | 0.841 |
| PAY_4 | int64 | 30000 | 0 | 0.0 | 11 | -2.0 | -1.00 | 0.0 | 0.00 | 8.0 | -0.221 | 0.0 | 1.169 | 1.000 |
| PAY_5 | int64 | 30000 | 0 | 0.0 | 10 | -2.0 | -1.00 | 0.0 | 0.00 | 8.0 | -0.266 | 0.0 | 1.133 | 1.008 |
| PAY_6 | int64 | 30000 | 0 | 0.0 | 10 | -2.0 | -1.00 | 0.0 | 0.00 | 8.0 | -0.291 | 0.0 | 1.150 | 0.948 |
| BILL_AMT1 | float64 | 30000 | 0 | 0.0 | 22723 | -165580.0 | 3558.75 | 22381.5 | 67091.00 | 964511.0 | 51223.331 | 22381.5 | 73635.861 | 2.664 |
| BILL_AMT2 | float64 | 30000 | 0 | 0.0 | 22346 | -69777.0 | 2984.75 | 21200.0 | 64006.25 | 983931.0 | 49179.075 | 21200.0 | 71173.769 | 2.705 |
| BILL_AMT3 | float64 | 30000 | 0 | 0.0 | 22026 | -157264.0 | 2666.25 | 20088.5 | 60164.75 | 1664089.0 | 47013.155 | 20088.5 | 69349.387 | 3.088 |
| BILL_AMT4 | float64 | 30000 | 0 | 0.0 | 21548 | -170000.0 | 2326.75 | 19052.0 | 54506.00 | 891586.0 | 43262.949 | 19052.0 | 64332.856 | 2.822 |
| BILL_AMT5 | float64 | 30000 | 0 | 0.0 | 21010 | -81334.0 | 1763.00 | 18104.5 | 50190.50 | 927171.0 | 40311.401 | 18104.5 | 60797.156 | 2.876 |
| BILL_AMT6 | float64 | 30000 | 0 | 0.0 | 20604 | -339603.0 | 1256.00 | 17071.0 | 49198.25 | 961664.0 | 38871.760 | 17071.0 | 59554.108 | 2.847 |
| PAY_AMT1 | float64 | 30000 | 0 | 0.0 | 7943 | 0.0 | 1000.00 | 2100.0 | 5006.00 | 873552.0 | 5663.580 | 2100.0 | 16563.280 | 14.668 |
| PAY_AMT2 | float64 | 30000 | 0 | 0.0 | 7899 | 0.0 | 833.00 | 2009.0 | 5000.00 | 1684259.0 | 5921.164 | 2009.0 | 23040.870 | 30.454 |
| PAY_AMT3 | float64 | 30000 | 0 | 0.0 | 7518 | 0.0 | 390.00 | 1800.0 | 4505.00 | 896040.0 | 5225.682 | 1800.0 | 17606.961 | 17.217 |
| PAY_AMT4 | float64 | 30000 | 0 | 0.0 | 6937 | 0.0 | 296.00 | 1500.0 | 4013.25 | 621000.0 | 4826.077 | 1500.0 | 15666.160 | 12.905 |
| PAY_AMT5 | float64 | 30000 | 0 | 0.0 | 6897 | 0.0 | 252.50 | 1500.0 | 4031.50 | 426529.0 | 4799.388 | 1500.0 | 15278.306 | 11.127 |
| PAY_AMT6 | float64 | 30000 | 0 | 0.0 | 6939 | 0.0 | 117.75 | 1500.0 | 4000.00 | 528666.0 | 5215.503 | 1500.0 | 17777.466 | 10.641 |
| default.payment.next.month | int64 | 30000 | 0 | 0.0 | 2 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 | 0.221 | 0.0 | 0.415 | 1.344 |
profile = ProfileReport(df, minimal=True)
profile
df = clean(df, method = "standardize")
df.head()
| id | limit_bal | sex | education | marriage | age | pay_0 | pay_2 | pay_3 | pay_4 | ... | bill_amt4 | bill_amt5 | bill_amt6 | pay_amt1 | pay_amt2 | pay_amt3 | pay_amt4 | pay_amt5 | pay_amt6 | default.payment.next.month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 20000.0 | 2 | 2 | 1 | 24 | 2 | 2 | -1 | -1 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 689.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 |
| 1 | 2 | 120000.0 | 2 | 2 | 2 | 26 | -1 | 2 | 0 | 0 | ... | 3272.0 | 3455.0 | 3261.0 | 0.0 | 1000.0 | 1000.0 | 1000.0 | 0.0 | 2000.0 | 1 |
| 2 | 3 | 90000.0 | 2 | 2 | 2 | 34 | 0 | 0 | 0 | 0 | ... | 14331.0 | 14948.0 | 15549.0 | 1518.0 | 1500.0 | 1000.0 | 1000.0 | 1000.0 | 5000.0 | 0 |
| 3 | 4 | 50000.0 | 2 | 2 | 1 | 37 | 0 | 0 | 0 | 0 | ... | 28314.0 | 28959.0 | 29547.0 | 2000.0 | 2019.0 | 1200.0 | 1100.0 | 1069.0 | 1000.0 | 0 |
| 4 | 5 | 50000.0 | 1 | 2 | 1 | 57 | -1 | 0 | -1 | 0 | ... | 20940.0 | 19146.0 | 19131.0 | 2000.0 | 36681.0 | 10000.0 | 9000.0 | 689.0 | 679.0 | 0 |
5 rows × 25 columns
to_categoric = ["sex", "education", "marriage",
"pay_0", "pay_2","pay_3", "pay_4", "pay_5","pay_6","default.payment.next.month"]
df = clean(df, method = 'dtypes', columns = to_categoric,
dtype='category')
df = df.rename(columns = {'pay_0': 'pay_1',}, inplace = False)
explore(df, method="summarize")
C:\Users\jdieg\AppData\Local\Programs\Python\Python39\lib\site-packages\quickda\explore_data.py:26: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction. C:\Users\jdieg\AppData\Local\Programs\Python\Python39\lib\site-packages\quickda\explore_data.py:30: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction.
| dtypes | count | null_sum | null_pct | nunique | min | 25% | 50% | 75% | max | mean | median | std | skew | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| age | int64 | 30000 | 0 | 0.0 | 56 | 21.0 | 28.0 | 34.0 | 41.0 | 79.0 | 35.4855 | 34.0 | 9.217904 | 0.732246 |
| bill_amt1 | float64 | 30000 | 0 | 0.0 | 22723 | -165580.0 | 3558.75 | 22381.5 | 67091.0 | 964511.0 | 51223.3309 | 22381.5 | 73635.860576 | 2.663861 |
| bill_amt2 | float64 | 30000 | 0 | 0.0 | 22346 | -69777.0 | 2984.75 | 21200.0 | 64006.25 | 983931.0 | 49179.075167 | 21200.0 | 71173.768783 | 2.705221 |
| bill_amt3 | float64 | 30000 | 0 | 0.0 | 22026 | -157264.0 | 2666.25 | 20088.5 | 60164.75 | 1664089.0 | 47013.1548 | 20088.5 | 69349.387427 | 3.08783 |
| bill_amt4 | float64 | 30000 | 0 | 0.0 | 21548 | -170000.0 | 2326.75 | 19052.0 | 54506.0 | 891586.0 | 43262.948967 | 19052.0 | 64332.856134 | 2.821965 |
| bill_amt5 | float64 | 30000 | 0 | 0.0 | 21010 | -81334.0 | 1763.0 | 18104.5 | 50190.5 | 927171.0 | 40311.400967 | 18104.5 | 60797.15577 | 2.87638 |
| bill_amt6 | float64 | 30000 | 0 | 0.0 | 20604 | -339603.0 | 1256.0 | 17071.0 | 49198.25 | 961664.0 | 38871.7604 | 17071.0 | 59554.107537 | 2.846645 |
| default.payment.next.month | category | 30000 | 0 | 0.0 | 2 | - | - | - | - | - | - | - | - | - |
| education | category | 30000 | 0 | 0.0 | 7 | - | - | - | - | - | - | - | - | - |
| id | int64 | 30000 | 0 | 0.0 | 30000 | 1.0 | 7500.75 | 15000.5 | 22500.25 | 30000.0 | 15000.5 | 15000.5 | 8660.398374 | 0.0 |
| limit_bal | float64 | 30000 | 0 | 0.0 | 81 | 10000.0 | 50000.0 | 140000.0 | 240000.0 | 1000000.0 | 167484.322667 | 140000.0 | 129747.661567 | 0.992867 |
| marriage | category | 30000 | 0 | 0.0 | 4 | - | - | - | - | - | - | - | - | - |
| pay_1 | category | 30000 | 0 | 0.0 | 11 | - | - | - | - | - | - | - | - | - |
| pay_2 | category | 30000 | 0 | 0.0 | 11 | - | - | - | - | - | - | - | - | - |
| pay_3 | category | 30000 | 0 | 0.0 | 11 | - | - | - | - | - | - | - | - | - |
| pay_4 | category | 30000 | 0 | 0.0 | 11 | - | - | - | - | - | - | - | - | - |
| pay_5 | category | 30000 | 0 | 0.0 | 10 | - | - | - | - | - | - | - | - | - |
| pay_6 | category | 30000 | 0 | 0.0 | 10 | - | - | - | - | - | - | - | - | - |
| pay_amt1 | float64 | 30000 | 0 | 0.0 | 7943 | 0.0 | 1000.0 | 2100.0 | 5006.0 | 873552.0 | 5663.5805 | 2100.0 | 16563.280354 | 14.668364 |
| pay_amt2 | float64 | 30000 | 0 | 0.0 | 7899 | 0.0 | 833.0 | 2009.0 | 5000.0 | 1684259.0 | 5921.1635 | 2009.0 | 23040.870402 | 30.453817 |
| pay_amt3 | float64 | 30000 | 0 | 0.0 | 7518 | 0.0 | 390.0 | 1800.0 | 4505.0 | 896040.0 | 5225.6815 | 1800.0 | 17606.96147 | 17.216635 |
| pay_amt4 | float64 | 30000 | 0 | 0.0 | 6937 | 0.0 | 296.0 | 1500.0 | 4013.25 | 621000.0 | 4826.076867 | 1500.0 | 15666.159744 | 12.904985 |
| pay_amt5 | float64 | 30000 | 0 | 0.0 | 6897 | 0.0 | 252.5 | 1500.0 | 4031.5 | 426529.0 | 4799.387633 | 1500.0 | 15278.305679 | 11.127417 |
| pay_amt6 | float64 | 30000 | 0 | 0.0 | 6939 | 0.0 | 117.75 | 1500.0 | 4000.0 | 528666.0 | 5215.502567 | 1500.0 | 17777.465775 | 10.640727 |
| sex | category | 30000 | 0 | 0.0 | 2 | - | - | - | - | - | - | - | - | - |
df = clean(df, method = 'dropcols', columns = ['id'])
df = clean(df, method = "replaceval",
columns = ["education"],
to_replace = [0,5,6],
value = 5)
df = clean(df, method = "replaceval",
columns = ["marriage"],
to_replace = [0],
value = 3)
Estos valores si tienen un significato en el dataset con respecto a la realizacion de los pagos por lo que no se pueden modificar a nulo.
correlation = df.corr()
plt.figure(figsize = (13, 10))
sn.heatmap(correlation, annot=True)
plt.show()
dfSimp = df
promedio = (dfSimp["bill_amt1"]+dfSimp["bill_amt2"]+dfSimp["bill_amt3"]
+dfSimp["bill_amt4"]+dfSimp["bill_amt5"]+dfSimp["bill_amt6"])/6
dfSimp["prom_bill_amt"] = promedio
dfSimp = clean(dfSimp, method = 'dropcols', columns = ['bill_amt1',"bill_amt2","bill_amt3","bill_amt4","bill_amt5","bill_amt6"])
cols = ['limit_bal', 'sex','education','marriage','age','pay_1','pay_2','pay_3',
'pay_4','pay_5','pay_6','prom_bill_amt','pay_amt1','pay_amt2','pay_amt3','pay_amt4','pay_amt5',
'pay_amt6','default.payment.next.month']
dfSimp = dfSimp[cols]
correlation = dfSimp.corr()
plt.figure(figsize = (13, 10))
sn.heatmap(correlation, annot=True)
plt.show()
explore(dfSimp, method="summarize")
C:\Users\jdieg\AppData\Local\Programs\Python\Python39\lib\site-packages\quickda\explore_data.py:26: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction. C:\Users\jdieg\AppData\Local\Programs\Python\Python39\lib\site-packages\quickda\explore_data.py:30: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction.
| dtypes | count | null_sum | null_pct | nunique | min | 25% | 50% | 75% | max | mean | median | std | skew | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| age | int64 | 30000 | 0 | 0.0 | 56 | 21.0 | 28.0 | 34.0 | 41.0 | 79.0 | 35.4855 | 34.0 | 9.217904 | 0.732246 |
| default.payment.next.month | category | 30000 | 0 | 0.0 | 2 | - | - | - | - | - | - | - | - | - |
| education | category | 30000 | 0 | 0.0 | 5 | - | - | - | - | - | - | - | - | - |
| limit_bal | float64 | 30000 | 0 | 0.0 | 81 | 10000.0 | 50000.0 | 140000.0 | 240000.0 | 1000000.0 | 167484.322667 | 140000.0 | 129747.661567 | 0.992867 |
| marriage | category | 30000 | 0 | 0.0 | 3 | - | - | - | - | - | - | - | - | - |
| pay_1 | category | 30000 | 0 | 0.0 | 11 | - | - | - | - | - | - | - | - | - |
| pay_2 | category | 30000 | 0 | 0.0 | 11 | - | - | - | - | - | - | - | - | - |
| pay_3 | category | 30000 | 0 | 0.0 | 11 | - | - | - | - | - | - | - | - | - |
| pay_4 | category | 30000 | 0 | 0.0 | 11 | - | - | - | - | - | - | - | - | - |
| pay_5 | category | 30000 | 0 | 0.0 | 10 | - | - | - | - | - | - | - | - | - |
| pay_6 | category | 30000 | 0 | 0.0 | 10 | - | - | - | - | - | - | - | - | - |
| pay_amt1 | float64 | 30000 | 0 | 0.0 | 7943 | 0.0 | 1000.0 | 2100.0 | 5006.0 | 873552.0 | 5663.5805 | 2100.0 | 16563.280354 | 14.668364 |
| pay_amt2 | float64 | 30000 | 0 | 0.0 | 7899 | 0.0 | 833.0 | 2009.0 | 5000.0 | 1684259.0 | 5921.1635 | 2009.0 | 23040.870402 | 30.453817 |
| pay_amt3 | float64 | 30000 | 0 | 0.0 | 7518 | 0.0 | 390.0 | 1800.0 | 4505.0 | 896040.0 | 5225.6815 | 1800.0 | 17606.96147 | 17.216635 |
| pay_amt4 | float64 | 30000 | 0 | 0.0 | 6937 | 0.0 | 296.0 | 1500.0 | 4013.25 | 621000.0 | 4826.076867 | 1500.0 | 15666.159744 | 12.904985 |
| pay_amt5 | float64 | 30000 | 0 | 0.0 | 6897 | 0.0 | 252.5 | 1500.0 | 4031.5 | 426529.0 | 4799.387633 | 1500.0 | 15278.305679 | 11.127417 |
| pay_amt6 | float64 | 30000 | 0 | 0.0 | 6939 | 0.0 | 117.75 | 1500.0 | 4000.0 | 528666.0 | 5215.502567 | 1500.0 | 17777.465775 | 10.640727 |
| prom_bill_amt | float64 | 30000 | 0 | 0.0 | 27370 | -56043.166667 | 4781.333333 | 21051.833333 | 57104.416667 | 877313.833333 | 44976.9452 | 21051.833333 | 63260.72186 | 2.734744 |
| sex | category | 30000 | 0 | 0.0 | 2 | - | - | - | - | - | - | - | - | - |
#Creacion de graficas para visualizar cada columna del dataset
dfSimp.hist(figsize=(30, 25))
array([[<AxesSubplot:title={'center':'limit_bal'}>,
<AxesSubplot:title={'center':'age'}>,
<AxesSubplot:title={'center':'prom_bill_amt'}>],
[<AxesSubplot:title={'center':'pay_amt1'}>,
<AxesSubplot:title={'center':'pay_amt2'}>,
<AxesSubplot:title={'center':'pay_amt3'}>],
[<AxesSubplot:title={'center':'pay_amt4'}>,
<AxesSubplot:title={'center':'pay_amt5'}>,
<AxesSubplot:title={'center':'pay_amt6'}>]], dtype=object)
dfSimp = clean(dfSimp, method='outliers', columns=["pay_amt1","pay_amt3","pay_amt5"])
dfSimp.index = list(range(0,23241))
explore(dfSimp, method="summarize")
C:\Users\jdieg\AppData\Local\Programs\Python\Python39\lib\site-packages\quickda\explore_data.py:26: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction. C:\Users\jdieg\AppData\Local\Programs\Python\Python39\lib\site-packages\quickda\explore_data.py:30: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction.
| dtypes | count | null_sum | null_pct | nunique | min | 25% | 50% | 75% | max | mean | median | std | skew | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| age | int64 | 23241 | 0 | 0.0 | 54 | 21.0 | 28.0 | 34.0 | 41.0 | 75.0 | 35.303903 | 34.0 | 9.394103 | 0.738224 |
| default.payment.next.month | category | 23241 | 0 | 0.0 | 2 | - | - | - | - | - | - | - | - | - |
| education | category | 23241 | 0 | 0.0 | 5 | - | - | - | - | - | - | - | - | - |
| limit_bal | float64 | 23241 | 0 | 0.0 | 73 | 10000.0 | 50000.0 | 110000.0 | 200000.0 | 800000.0 | 143288.670883 | 110000.0 | 115338.309588 | 1.167094 |
| marriage | category | 23241 | 0 | 0.0 | 3 | - | - | - | - | - | - | - | - | - |
| pay_1 | category | 23241 | 0 | 0.0 | 11 | - | - | - | - | - | - | - | - | - |
| pay_2 | category | 23241 | 0 | 0.0 | 11 | - | - | - | - | - | - | - | - | - |
| pay_3 | category | 23241 | 0 | 0.0 | 11 | - | - | - | - | - | - | - | - | - |
| pay_4 | category | 23241 | 0 | 0.0 | 10 | - | - | - | - | - | - | - | - | - |
| pay_5 | category | 23241 | 0 | 0.0 | 10 | - | - | - | - | - | - | - | - | - |
| pay_6 | category | 23241 | 0 | 0.0 | 10 | - | - | - | - | - | - | - | - | - |
| pay_amt1 | float64 | 23241 | 0 | 0.0 | 4975 | 0.0 | 403.0 | 1950.0 | 3500.0 | 11012.0 | 2457.650876 | 1950.0 | 2392.254303 | 1.239643 |
| pay_amt2 | float64 | 23241 | 0 | 0.0 | 5412 | 0.0 | 400.0 | 1865.0 | 3585.0 | 385228.0 | 3596.274171 | 1865.0 | 10632.681991 | 13.201833 |
| pay_amt3 | float64 | 23241 | 0 | 0.0 | 4517 | 0.0 | 150.0 | 1376.0 | 3000.0 | 9072.0 | 1902.397659 | 1376.0 | 1977.181619 | 1.229872 |
| pay_amt4 | float64 | 23241 | 0 | 0.0 | 4831 | 0.0 | 90.0 | 1100.0 | 3000.0 | 256662.0 | 3057.287208 | 1100.0 | 9855.723264 | 10.867168 |
| pay_amt5 | float64 | 23241 | 0 | 0.0 | 4029 | 0.0 | 0.0 | 1001.0 | 2600.0 | 7464.0 | 1657.465384 | 1001.0 | 1783.749048 | 1.143616 |
| pay_amt6 | float64 | 23241 | 0 | 0.0 | 4781 | 0.0 | 0.0 | 1041.0 | 3000.0 | 528666.0 | 3269.936492 | 1041.0 | 12622.460778 | 14.388722 |
| prom_bill_amt | float64 | 23241 | 0 | 0.0 | 20895 | -56043.166667 | 2789.833333 | 18756.666667 | 48071.666667 | 456957.5 | 34871.408301 | 18756.666667 | 43649.553985 | 1.8924 |
| sex | category | 23241 | 0 | 0.0 | 2 | - | - | - | - | - | - | - | - | - |
Despues de eliminar algunos datos atipicos nos quedamos con 23241 observaciones que es alrededor del 77% del data set original, lo que nos parece aun una buena cantidad de elementos con los que trabajar.
#Creacion de graficas para visualizar cada columna del dataset
dfSimp.hist(figsize=(30, 25))
array([[<AxesSubplot:title={'center':'limit_bal'}>,
<AxesSubplot:title={'center':'age'}>,
<AxesSubplot:title={'center':'prom_bill_amt'}>],
[<AxesSubplot:title={'center':'pay_amt1'}>,
<AxesSubplot:title={'center':'pay_amt2'}>,
<AxesSubplot:title={'center':'pay_amt3'}>],
[<AxesSubplot:title={'center':'pay_amt4'}>,
<AxesSubplot:title={'center':'pay_amt5'}>,
<AxesSubplot:title={'center':'pay_amt6'}>]], dtype=object)
dfSimp['education'].value_counts().plot(kind='bar', title = "Educacion")
<AxesSubplot:title={'center':'Educacion'}>
dfSimp['marriage'].value_counts().plot(kind='bar', title = "Estado civil")
<AxesSubplot:title={'center':'Estado civil'}>
dfSimp['sex'].value_counts().plot(kind='bar', title = "Sexo")
<AxesSubplot:title={'center':'Sexo'}>
dfSimp['default.payment.next.month'].value_counts().plot(kind='bar', title = "Impago")
<AxesSubplot:title={'center':'Impago'}>
pd.crosstab(dfSimp.education, dfSimp['default.payment.next.month']).plot(kind="bar",figsize=(15,6), title = "Analisis de pago por educacion")
plt.xlabel('Educacion')
plt.xticks(rotation=0)
plt.legend(["Pago", "Impago"])
plt.show()
pd.crosstab(dfSimp.age, dfSimp['default.payment.next.month']).plot(kind="bar",figsize=(15,6), title = "Analisis de pago por edad")
plt.xlabel('Edad')
plt.xticks(rotation=0)
plt.legend(["Pago", "Impago"])
plt.show()
pd.crosstab(dfSimp.sex, dfSimp['default.payment.next.month']).plot(kind="bar",figsize=(15,6), title = "Analisis de pago por sexo")
plt.xlabel('Genero')
plt.xticks(rotation=0)
plt.legend(["Pago", "Impago"])
plt.show()
pd.crosstab(dfSimp.marriage, dfSimp['default.payment.next.month']).plot(kind="bar",figsize=(15,6),title = "Analisis de pago por estado civil")
plt.xlabel('Estado Civil')
plt.xticks(rotation=0)
plt.legend(["Pago", "Impago"])
plt.show()
sns.catplot(x="default.payment.next.month", y="limit_bal", data=dfSimp)
<seaborn.axisgrid.FacetGrid at 0x2a706cfa5b0>
sns.catplot(x="education", y="limit_bal", data=dfSimp)
<seaborn.axisgrid.FacetGrid at 0x2a707394d90>
sns.catplot(x="default.payment.next.month", y="prom_bill_amt", data=dfSimp)
<seaborn.axisgrid.FacetGrid at 0x2a7097810a0>
El conjunto de datos tras las operaciones de limpieza cuenta con 23241 observaciones y 19 variables. De estas variables:
Al analizar las variables cualitativas podemos notar ciertas características importantes como:
De las variables cuantitativos podemos observar lo siguiente:
Al cruzar variables podemos observar lo siguiente: